<!DOCTYPE html>
<html lang="en">
  <head>
    <meta charset="utf-8">
    <meta name="viewport" content="width=device-width, initial-scale=1">
    <title>生存分析预测 | 阿力木·达依木|Alimu Dayimu</title>
    <link rel="stylesheet" href="https://alim.gitee.io/css/style.css" />
    <link rel="stylesheet" href="https://alim.gitee.io/css/fonts.css" />
    
<link href="//cdn.jsdelivr.net/gh/highlightjs/cdn-release@11.6.0/build/styles/github.min.css" rel="stylesheet">


    


<link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/katex@0.16.0/dist/katex.min.css" integrity="sha384-Xi8rHCmBmhbuyyhbI88391ZKP2dmfnOl4rT9ZfRI7mLTdk1wblIUnrIq35nqwEvC" crossorigin="anonymous">
<script defer src="https://cdn.jsdelivr.net/npm/katex@0.16.0/dist/katex.min.js" integrity="sha384-X/XCfMm41VSsqRNQgDerQczD69XqmjOOOwYQvr/uuC+j4OPoNhVgjdGFwhvN02Ja" crossorigin="anonymous"></script>
<script defer src="https://cdn.jsdelivr.net/npm/katex@0.16.0/dist/contrib/auto-render.min.js" integrity="sha384-+XBljXPPiv+OzfbB3cVmLHf4hdUFHlWNZN5spNQ7rmHTXpd7WvJum6fIACpNNfIR" crossorigin="anonymous"></script>
<script>
    document.addEventListener("DOMContentLoaded", function() {
        renderMathInElement(document.body, {
          
          
          delimiters: [
          {left: "$$", right: "$$", display: true},
          {left: "$", right: "$", display: false},
          {left: "\\(", right: "\\)", display: false},
          {left: "\\[", right: "\\]", display: true},
          {left: "\\begin{equation}", right: "\\end{equation}", display: true},
          {left: "\\begin{align}", right: "\\end{align}", display: true},
          {left: "\\begin{alignat}", right: "\\end{alignat}", display: true},
          {left: "\\begin{gather}", right: "\\end{gather}", display: true},
          {left: "\\begin{CD}", right: "\\end{CD}", display: true},
          {left: "\\[", right: "\\]", display: true}
          ],
          
          throwOnError : false
        });
    });
</script>


  </head>

  <body>
    <nav>
    <ul class="menu">
      
      <li><a href="https://alim.gitee.io/">Home</a></li>
      
      <li><a href="https://alim.gitee.io/post/">Post</a></li>
      
      <li><a href="https://alim.gitee.io/talks/">Talks</a></li>
      
      <li><a href="https://alim.gitee.io/about/">About</a></li>
      
    </ul>
    <hr/>
    </nav>

<div class="article-meta">
<h1 class="title">生存分析预测</h1>
<div class="meta">
<span class="author">Alimu Dayimu</span>
<span class="date middot">2022/09/09</span>

<span class="reading-time middot"> 5 min read </span>
<div class="terms">
  
  
  
  
  
</div>
</div>
</div>




<main>



<p>医学研究中经常会用到生存分析问题，看到生存分析很多人立马想到的就是Cox等比例风险模型。但是作为半参数模型的Cox模型最大的优点就是不需要估计基准风向<span class="math inline">\(h_0(t)\)</span>（有些地方写作<span class="math inline">\(\lambda_0(t)\)</span>），但是最大的缺点也是不估计基准风险函数。如果目的是估计某个变量的风险比的话，Cox模型能够完全满足要求，因为不需要计算绝对风险。但是如果研究目风险预测（计算绝对风险），尤其是需要对结果进行外推的话，那么Cox模型可能就不合适了。</p>
<div id="问题" class="section level1">
<h1>问题</h1>
<p>考虑<a href="https://andrew-leroux.github.io/rnhanesdata/">NHANES</a>数据，数据包括20到85岁人群的生存数据，其中包括一些协变量，结局可是死亡。我们想拟合模型，并将这个模型用在其他的新人群上，预测这些新的人群中个体的绝对风险以及这些人还能活多长时间（life expectancy）。下面是数据：</p>
<pre class="r"><code>load(&#39;NHANES.Rdata&#39;)
df &lt;- df[df$age &gt;= 20 &amp; !is.na(df$age), ]
df &lt;- na.omit(df) # Delete rows with NA
df$age_fin &lt;- df$age + df$permth_int/12</code></pre>
</div>
<div id="思路" class="section level1">
<h1>思路</h1>
<p>在开始的时候我们需要考虑怎么生存模型的时间尺度如何选择，我们可以简单的使用随访时间来计算我们的模型，但是我们预测不同年龄段的风险的时候可能会存在问题。风险是随着年龄变化的，而不是随着随访时间变化的，对于一个20岁的人随访5年以后风险水平与70岁的人随访5年以后两者的基准风险肯定不同。因此，我们这里考虑采用年龄作为我们的时间尺度<span class="citation">(<a href="#ref-kom1997time" role="doc-biblioref">Kom, Graubard, and Midthune 1997</a>)</span>。</p>
<p>而后因为这个是风险预测的问题，我们首先就需要排除Cox模型，参数模型在这个时候是更好的选择。参数模型有Exponential、Weibull、Gompertz、Gamma、Lognormal、Log-logistic以及Generalized gamma等分布，我们需要考虑选择何种风险分布。首先我们可以考虑排除参数模型有Exponential，期风险是恒定的，根据我们的假设，我们需要风险随着年龄的升高而升高，并且年龄越大死亡风险越高。首先我们可以看看这些不同的分布的风险随着年龄的变化是怎么变化的。</p>
<pre class="r"><code>library(flexsurv)
library(ggplot2)
library(data.table)

dists &lt;- c(&quot;exp&quot;, &quot;weibull&quot;, &quot;gompertz&quot;, &quot;gamma&quot;, 
           &quot;lognormal&quot;, &quot;llogis&quot;, &quot;gengamma&quot;)
dists_long &lt;- c(&quot;Exponential&quot;, &quot;Weibull (AFT)&quot;,
                &quot;Gompertz&quot;, &quot;Gamma&quot;, &quot;Lognormal&quot;, &quot;Log-logistic&quot;,
                &quot;Generalized gamma&quot;)
parametric_haz &lt;- vector(mode = &quot;list&quot;, length = length(dists))
for (i in 1:length(dists)){
  fit &lt;- flexsurvreg(Surv(age, age_fin, mortstat) ~ 1, 
                     data = df, dist = dists[i]) 
  parametric_haz[[i]] &lt;- summary(fit, type = &quot;hazard&quot;, 
                                     ci = FALSE, tidy = TRUE)
  parametric_haz[[i]]$method &lt;- dists_long[i]
}
parametric_haz &lt;- rbindlist(parametric_haz)

n_dists &lt;- length(dists) 
ggplot(parametric_haz, aes(x = time, y = est, 
                           col = method, linetype = method)) +
  geom_line() +
  labs(x = &quot;Age&quot;, y = &quot;Hazard&quot;) + 
  scale_colour_brewer(palette = &quot;Set1&quot;)</code></pre>
<p><img src="https://alim.gitee.io/post/2022/survival-prediction/index_files/figure-html/baseline-surv-1.png" width="672" /></p>
<p>通过上面的结果我们发现，除了Exponential分布的风险不随年龄变化之外其余均随年龄的增长而升高，尤其是Generalized gamma、Gompertz和Weibull比较符合预期。因为Generalized gamma有三个参数需要估计，并且为了方便我们计算预期寿命，最后根据实际，我们考虑Gompertz作为我们的风险分布。当然，这个选择过程也可以完全按照AIC或者是BIC来进行。</p>
<p>上面模型构建的策略是我们根据实际问题以及数据的情况结合选择的，任何模型的创建都应该根据上面的原则进行。</p>
</div>
<div id="模型拟合" class="section level1">
<h1>模型拟合</h1>
<p>下面我们拟合模型</p>
<pre class="r"><code>fit &lt;- flexsurvreg(Surv(age, age_fin, mortstat) ~ bmi + 
                     gender + race + chf + diabetes + chd +
                     cancer + stroke + educationadult +
                     mobilityproblem + drinkstatus + smokecigs,
                   data = df, dist = &quot;gompertz&quot;) 
fit</code></pre>
<pre><code>## Call:
## flexsurvreg(formula = Surv(age, age_fin, mortstat) ~ bmi + gender + 
##     race + chf + diabetes + chd + cancer + stroke + educationadult + 
##     mobilityproblem + drinkstatus + smokecigs, data = df, dist = &quot;gompertz&quot;)
## 
## Estimates: 
##                                                   data mean  est      
## shape                                                    NA   8.05e-02
## rate                                                     NA   7.85e-05
## bmi                                                2.85e+01  -1.20e-02
## genderFemale                                       5.16e-01  -4.81e-01
## raceMexican American                               1.98e-01  -2.66e-02
## raceOther Hispanic                                 2.92e-02  -6.43e-02
## raceBlack                                          1.92e-01   1.01e-01
## raceOther                                          3.96e-02   8.59e-02
## chfYes                                             3.58e-02   4.70e-01
## chfDon&#39;t know                                      4.50e-03  -7.58e-02
## diabetesYes                                        1.08e-01   3.45e-01
## diabetesBorderline                                 1.42e-02   1.18e-01
## chdYes                                             5.19e-02   1.65e-01
## chdDon&#39;t know                                      4.50e-03   5.57e-01
## cancerYes                                          9.58e-02   3.84e-01
## cancerDon&#39;t know                                   2.37e-03   7.54e-01
## strokeNo                                           9.60e-01  -1.73e-01
## strokeDon&#39;t know                                   1.19e-03   3.15e-01
## educationadult9-11th grade                         1.45e-01   2.26e-02
## educationadultHigh school grad/GED or equivalent   2.50e-01   1.35e-01
## educationadultSome College or AA degree            2.77e-01   4.43e-02
## educationadultCollege graduate or above            1.89e-01  -2.86e-01
## mobilityproblemAny Difficulty                      2.42e-01   5.76e-01
## drinkstatusNon-Drinker                             3.72e-01   2.90e-01
## drinkstatusHeavy Drinker                           6.33e-02   4.79e-01
## smokecigsFormer                                    2.76e-01   8.81e-02
## smokecigsCurrent                                   2.23e-01   6.51e-01
##                                                   L95%       U95%     
## shape                                              7.40e-02   8.70e-02
## rate                                               3.64e-05   1.70e-04
## bmi                                               -2.60e-02   1.94e-03
## genderFemale                                      -6.29e-01  -3.33e-01
## raceMexican American                              -2.40e-01   1.87e-01
## raceOther Hispanic                                -5.44e-01   4.16e-01
## raceBlack                                         -1.09e-01   3.10e-01
## raceOther                                         -3.29e-01   5.01e-01
## chfYes                                             2.39e-01   7.01e-01
## chfDon&#39;t know                                     -7.39e-01   5.88e-01
## diabetesYes                                        1.75e-01   5.16e-01
## diabetesBorderline                                -3.34e-01   5.70e-01
## chdYes                                            -4.76e-02   3.77e-01
## chdDon&#39;t know                                     -2.70e-02   1.14e+00
## cancerYes                                          2.12e-01   5.56e-01
## cancerDon&#39;t know                                  -4.14e-01   1.92e+00
## strokeNo                                          -3.97e-01   5.10e-02
## strokeDon&#39;t know                                  -7.06e-01   1.34e+00
## educationadult9-11th grade                        -2.05e-01   2.50e-01
## educationadultHigh school grad/GED or equivalent  -7.62e-02   3.47e-01
## educationadultSome College or AA degree           -1.77e-01   2.65e-01
## educationadultCollege graduate or above           -5.56e-01  -1.67e-02
## mobilityproblemAny Difficulty                      4.27e-01   7.24e-01
## drinkstatusNon-Drinker                             1.39e-01   4.41e-01
## drinkstatusHeavy Drinker                           1.73e-01   7.85e-01
## smokecigsFormer                                   -7.08e-02   2.47e-01
## smokecigsCurrent                                   4.34e-01   8.68e-01
##                                                   se         exp(est) 
## shape                                              3.31e-03         NA
## rate                                               3.08e-05         NA
## bmi                                                7.14e-03   9.88e-01
## genderFemale                                       7.54e-02   6.18e-01
## raceMexican American                               1.09e-01   9.74e-01
## raceOther Hispanic                                 2.45e-01   9.38e-01
## raceBlack                                          1.07e-01   1.11e+00
## raceOther                                          2.12e-01   1.09e+00
## chfYes                                             1.18e-01   1.60e+00
## chfDon&#39;t know                                      3.39e-01   9.27e-01
## diabetesYes                                        8.70e-02   1.41e+00
## diabetesBorderline                                 2.31e-01   1.13e+00
## chdYes                                             1.08e-01   1.18e+00
## chdDon&#39;t know                                      2.98e-01   1.75e+00
## cancerYes                                          8.76e-02   1.47e+00
## cancerDon&#39;t know                                   5.96e-01   2.13e+00
## strokeNo                                           1.14e-01   8.41e-01
## strokeDon&#39;t know                                   5.21e-01   1.37e+00
## educationadult9-11th grade                         1.16e-01   1.02e+00
## educationadultHigh school grad/GED or equivalent   1.08e-01   1.14e+00
## educationadultSome College or AA degree            1.13e-01   1.05e+00
## educationadultCollege graduate or above            1.37e-01   7.51e-01
## mobilityproblemAny Difficulty                      7.59e-02   1.78e+00
## drinkstatusNon-Drinker                             7.72e-02   1.34e+00
## drinkstatusHeavy Drinker                           1.56e-01   1.61e+00
## smokecigsFormer                                    8.11e-02   1.09e+00
## smokecigsCurrent                                   1.11e-01   1.92e+00
##                                                   L95%       U95%     
## shape                                                    NA         NA
## rate                                                     NA         NA
## bmi                                                9.74e-01   1.00e+00
## genderFemale                                       5.33e-01   7.16e-01
## raceMexican American                               7.86e-01   1.21e+00
## raceOther Hispanic                                 5.80e-01   1.52e+00
## raceBlack                                          8.97e-01   1.36e+00
## raceOther                                          7.19e-01   1.65e+00
## chfYes                                             1.27e+00   2.02e+00
## chfDon&#39;t know                                      4.77e-01   1.80e+00
## diabetesYes                                        1.19e+00   1.67e+00
## diabetesBorderline                                 7.16e-01   1.77e+00
## chdYes                                             9.53e-01   1.46e+00
## chdDon&#39;t know                                      9.73e-01   3.13e+00
## cancerYes                                          1.24e+00   1.74e+00
## cancerDon&#39;t know                                   6.61e-01   6.84e+00
## strokeNo                                           6.72e-01   1.05e+00
## strokeDon&#39;t know                                   4.94e-01   3.81e+00
## educationadult9-11th grade                         8.15e-01   1.28e+00
## educationadultHigh school grad/GED or equivalent   9.27e-01   1.41e+00
## educationadultSome College or AA degree            8.38e-01   1.30e+00
## educationadultCollege graduate or above            5.74e-01   9.83e-01
## mobilityproblemAny Difficulty                      1.53e+00   2.06e+00
## drinkstatusNon-Drinker                             1.15e+00   1.55e+00
## drinkstatusHeavy Drinker                           1.19e+00   2.19e+00
## smokecigsFormer                                    9.32e-01   1.28e+00
## smokecigsCurrent                                   1.54e+00   2.38e+00
## 
## N = 4218,  Events: 837,  Censored: 3381
## Total time at risk: 45671.92
## Log-likelihood = -3319.692, df = 27
## AIC = 6693.384</code></pre>
<p>模型拟合完后下一步将预测风险和预测期望寿命（life expectancy），对于期望寿命的计算我们采用<span class="citation">Castellares et al. (<a href="#ref-castellares2020closed" role="doc-biblioref">2020</a>)</span>提到的方法。</p>
<pre class="r"><code>surv_prob &lt;- function(x, newdata, time){
  # time
  t &lt;- newdata$age + time
  # Rate param
  rate &lt;- predict(fit, newdata = newdata, type = &quot;link&quot;)$.pred_link
  # Shape param
  pars &lt;- x$res.t[x$dlist$pars,&quot;est&quot;][1]
  ret &lt;- (1 - x$dfns$p(t, shape = pars, rate = rate))/
    (1 - x$dfns$p(0, shape = pars, rate = rate))
  ret[t &lt; 0] &lt;- 1 
  ret
}

life_exp &lt;- function(x, newdata){
  rate &lt;- predict(x, newdata = newdata, type = &quot;link&quot;)$.pred_link
  shape &lt;- x$res.t[x$dlist$pars,&quot;est&quot;][1]

  # Life expectation calculation
  le &lt;- function(t, a, b){
  (exp(a*exp(b*t)/b)/b)*expint::expint_E1(a*exp(b*t)/b)
  }

  le(newdata$age, rate, shape)
}</code></pre>
<p>下面假设新的数据是我们原来的数据，一般这个应该是测试数据（test data）。下面的结果中我们可以看到期望寿命和模型估计的生存年龄是有差异的，期望寿命相对来说更准确。同时还需要注意的是，这个模型里面纳入的变量并不一定完全合适，所以模型拟合的结果可能不一定很理想，在实际应用中需要考虑模型的合理性问题。</p>
<pre class="r"><code># 5年的生存风险
df$surv_prob_5year &lt;- surv_prob(fit, df, 5)
df$life_exp &lt;- life_exp(fit, newdata = df)
df$predict &lt;- predict(fit, newdata = df, type = &quot;response&quot;)$.pred_time
head(df[,c(&quot;age&quot;, &quot;age_fin&quot;, &quot;mortstat&quot;, &quot;surv_prob_5year&quot;, &quot;life_exp&quot;, &quot;predict&quot;)])</code></pre>
<pre><code>##    age  age_fin mortstat surv_prob_5year  life_exp  predict
## 5   55 66.25000        0      0.89843477 28.329840 80.55651
## 6   52 64.41667        0      0.88900211 27.338034 76.38904
## 11  83 85.00000        1      0.03799231  4.185851 66.32067
## 13  37 49.66667        0      0.97842741 45.784111 81.98390
## 14  33 45.91667        0      0.98117709 47.245666 79.54103
## 15  50 62.83333        0      0.93128054 32.623918 80.57829</code></pre>
</div>
<div id="参考文献" class="section level1 unnumbered">
<h1>参考文献</h1>
<div id="refs" class="references csl-bib-body hanging-indent">
<div id="ref-castellares2020closed" class="csl-entry">
Castellares, Fredy, Silvio C Patrı́cio, Artur J Lemonte, and Bernardo L Queiroz. 2020. <span>“On Closed-Form Expressions to Gompertz–Makeham Life Expectancy.”</span> <em>Theoretical Population Biology</em> 134: 53–60.
</div>
<div id="ref-kom1997time" class="csl-entry">
Kom, Edward L, Barry I Graubard, and Douglas Midthune. 1997. <span>“Time-to-Event Analysis of Longitudinal Follow-up of a Survey: Choice of the Time-Scale.”</span> <em>American Journal of Epidemiology</em> 145 (1): 72–80.
</div>
<div id="ref-missov2013gompertz" class="csl-entry">
Missov, Trifon I, and Adam Lenart. 2013. <span>“Gompertz–Makeham Life Expectancies: Expressions and Applications.”</span> <em>Theoretical Population Biology</em> 90: 29–35.
</div>
</div>
</div>

</main>


  <footer>
  <script src="//cdn.jsdelivr.net/gh/highlightjs/cdn-release@11.6.0/build/highlight.min.js"></script>
<script src="//cdn.jsdelivr.net/gh/highlightjs/cdn-release@11.6.0/build/languages/r.min.js"></script>

<script>
hljs.configure({languages: []});
hljs.initHighlightingOnLoad();
</script>

  
  <hr/>
  © <a href="https://alim.gitee.io/">Alimu Dayimu</a> 2021 - 2022 | <a href="https://creativecommons.org/licenses/by-nc-nd/4.0/">BY-NC-ND 4.0</a> | <a href="https://github.com/adayim">GitHub</a> | <a href="https://gitee.com/alim">gitee</a> | <a href="https://alim.gitee.io/index.xml">RSS</a>
  
  
  </footer>
  </body>
</html>
